chore(deps): update dependency torch to v2.12.0#245
Open
dreadnode-renovate-bot[bot] wants to merge 1 commit into
Open
chore(deps): update dependency torch to v2.12.0#245dreadnode-renovate-bot[bot] wants to merge 1 commit into
dreadnode-renovate-bot[bot] wants to merge 1 commit into
Conversation
| datasource | package | from | to | | ---------- | ------- | ------ | ------ | | pypi | torch | 2.11.0 | 2.12.0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
| Package | Change | Age | Confidence |
|
Generated Summary:
dyana/loaders/automodel/requirements.txtdyana/loaders/base/dyana-requirements-gpu.txtdyana/loaders/lora/requirements.txtThis summary was generated with ❤️ by rigging
| torch |
|
|
==2.11.0→==2.12.0|Warning
Some dependencies could not be looked up. Check the Dependency Dashboard for more information.
Release Notes
pytorch/pytorch (torch)
v2.12.0: PyTorch 2.12.0 ReleaseCompare Source
PyTorch 2.12.0 Release Notes
Highlights
fused=True, joining Adam, AdamW, and SGD with a single-kernel optimizer implementation.For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.
Backwards Incompatible Changes
Build Frontend
Strengthened SVE compile checks in
FindARM.cmake, which may reject previously accepted but incorrect SVE configurations (#176646)Source builds that enable SVE now validate the compiler configuration more strictly. If a build previously passed with an incomplete or mismatched SVE setup, it may now fail during CMake configuration instead of later in compilation. Update the compiler/toolchain flags so they accurately describe the target SVE support, or disable SVE for that build.
Updated the minimum CUDA version required to build PyTorch from source to CUDA 12.6 (#178925)
Building PyTorch from source with CUDA versions older than 12.6 is no longer supported. Users building custom binaries should install CUDA 12.6 or newer and make sure
CUDA_HOMEpoints to that installation.Version 2.11:
Version 2.12:
Enforced a C++20 minimum in CMake build files (#178662)
Source builds now require a compiler and build configuration that support C++20. If you maintain custom build scripts or downstream extensions that build PyTorch from source, update the compiler and remove assumptions that PyTorch can be built as C++17.
Distributed
torch.distributed.nn.functionalops now raiseRuntimeErrorundertorch.compile(#177342)All ops in
torch.distributed.nn.functional(e.g.,broadcast,all_reduce,all_gather,reduce_scatter,all_to_all_single) now raiseRuntimeErrorwhen called insidetorch.compile. Users should migrate to the functional collectives API intorch.distributed._functional_collectives.Version 2.11:
Version 2.12:
TorchElastic
torchrunnow defaults to an OS-assigned free port for single-node training instead of port 29500 (#175699)When running
torchrun --nproc-per-node=N script.pywithout specifying--master-portor--standalone, the default behavior now automatically uses an OS-assigned free port via thec10drendezvous backend. This eliminates "Address already in use" errors when running multiple training jobs concurrently. Multi-node training, explicit--master-port,PET_MASTER_PORTenv var, and--standaloneare unchanged.Version 2.11:
# Used static rendezvous on port 29500 by default torchrun --nproc-per-node=4 train.pyVersion 2.12:
MPS
All MPS tensors are now allocated in unified memory (#175818)
Previously, MPS tensors could be allocated in either device-only or unified memory. Now all MPS tensors use unified memory unconditionally. This simplifies memory management and enables CPU access to MPS tensor data without explicit copies. Code that relied on device-only memory placement may observe different performance characteristics.
Inductor
The
max_autotunelayout-constraint deferral introduced in 2.11 is now opt-in (#175330)In 2.11, Inductor deferred layout freezing for
max_autotunetemplates to expose more fusion opportunities. This caused a regional-inductor failure mode, so the default in 2.12 reverts to immediate layout freezing. Users who relied on the deferred behavior for fusion opportunities should opt in explicitly viatorch._inductor.config.max_autotune_defer_layout_freezingorTORCHINDUCTOR_MAX_AUTOTUNE_DEFER_LAYOUT_FREEZING=1.Version 2.11:
Version 2.12:
Deprecations
Release Engineering
Deprecate CUDA 12.8 builds in favor of CUDA 13.0 (#179072)
CUDA 12.8 binaries have been removed from the PyTorch binary build matrix. CUDA 13.0 is now the stable default and CUDA 12.6 remains available for users on older drivers. Users explicitly pinning the
cu128index URL will need to switch tocu130(recommended) orcu126.Version 2.11:
Version 2.12:
Compatibility with CMake < 3.10 will be removed in a future release (#166259)
Source builds against CMake versions older than 3.10 now emit a deprecation warning. A future release will require CMake 3.10 or newer; please upgrade CMake before then.
Linear Algebra
Several CUDA linear algebra operators no longer use the MAGMA backend and now dispatch to cuSolver or cuBLAS unconditionally:
torch.linalg.eighnow dispatches to cuSolver (#174619)torch.linalg.lu_solvenow dispatches to cuSolver/cuBLAS (#174248)torch.linalg.cholesky_inversenow dispatches to cuSolver (#174681)torch.linalg.cholesky_solvenow dispatches to cuSolver (#174769)User code calling these APIs does not need to change. The practical impact is for users who depended on MAGMA-specific numerical behavior, performance characteristics, or debugging. Those calls now use the cuSolver/cuBLAS implementations on CUDA.
FullyShardedDataParallel2 (FSDP2)
Compiling through FSDP2 hooks without graph breaks is no longer supported (#174863, #174906). If you use compiled autograd with FSDP2, update your code to allow graph breaks around FSDP2 hooks or disable compiled autograd for the FSDP2 training step.
Version 2.11:
Version 2.12:
Profiler
Profiler's
metadata_jsonfield is now deprecated; useevent_metadatainstead (#179417)Version 2.11:
Version 2.12:
Dynamo
torch.compile(fullgraph=True)now warns when a call runs no compiled code; will error in 2.13 (#181940)Previously
fullgraph=Truewas only validated once Dynamo actually compiled and ran the function. If Dynamo was bypassed at call time (e.g. under a user-definedTorchDispatchMode), the annotation silently had no effect. 2.12 emits a warning; 2.13 will raise. For graph-break errors withoutfullgraph's stronger guarantees, usetorch._dynamo.error_on_graph_break.Version 2.12:
The
inline_inbuilt_nn_modulesDynamo config is deprecated (#177489, #178205)Inlining of in-built
nn.Moduleinstances is now the default; setting the flag emits a deprecation warning and it will be removed in a future release.Version 2.11:
Version 2.12:
Added a deprecation framework to the
torch.compileconfig module so individual options can be marked deprecated (#169837)New Features
Release Engineering
Python Frontend
torch.accelerator.Graphas a unified frontend Graph interface (#171285)Foreach
_foreach_cloneoperator, with a fast path for CUDA utilizing_foreach_copy_(#177421)Distributed
Store::barrierAPI and TCPStore clientBARRIERsupport, reducing synchronization round trips compared to the existingADD+WAITpattern (#174920)suspend(),resume(), andmemory_stats()APIs for managing communicator memory lifecycle (#176300)all_to_allsupport in the Gloo backend (#165435)reduce_scatter_offsetto symmetric memory, supporting variable-sized block reductions with NVLink multicast or LSA fallback (#177791)batch_isend_irecvto work undertorch.compile(#161213)torch.distributed.symmetric_memory.is_symm_mem_tensor()API to check if a tensor is a symmetric memory tensor (#178947)NanCheckto a standalone op (torch.ops.c10d.check_for_nan) usable outside ofProcessGroupNCCL(#174990)DTensor
grad_placementsparameter toDTensor.from_local(), allowing explicit control over gradient placements in the backward pass (#175867)FullyShardedDataParallel2 (FSDP2)
fully_shardwith DTensors on a full SPMD mesh viaDataParallelMeshDims(#176334)TorchElastic
--shutdown-timeouttotorchrunfor controlling the SIGTERM-to-SIGKILL timeout during worker shutdown (#172596)CPU x86
CPUBlasbrgemm API for fp8 (e4m3 & e5m2) GEMM, backed by oneDNN (#172548)CUDA
torch.condwith CUDA graphs, using conditional graph nodes (CUDA 12.4+) so data-dependent control flow can be captured entirely inside a single CUDA graph. Works with theeagerandcudagraphstorch.compilebackends (no Inductor support yet). (#168912)MPS
linalg_qrfor MPS (#172536)cholesky_solvesupport on MPS (#176703)index_reduceon MPS (#174936)torch.distributions.Gamma(forward + backward) on MPS (#179228)mvlgammaon MPS (#178914)nonzero_staticimplementation on MPS (#179589) (from miscategorized)ROCm
XPU
torch.accelerator.Graphon XPU (#176421)memory_clock_rateandmemory_bus_widthto XPU device properties (#171967)split_groupAPI when TorchComms is used as a backend for TorchTitan on XPU (#178236)Profiler
Dynamo
torch._dynamo.aot_compilepublic, withaot_eagerandinductorbackend support and docs (#179917, #180008)recompile_limitkeyword argument totorch.compileto override the per-function recompile cap without touching global config (#177936)torch._dynamo.mark_unbackedfor communicating value ranges to the symbolic shape system (#176313)bdb, apdb-style debugger for stepping through nested frames during Dynamo tracing (n,u,d,r,bt), plus a user-callablebreakpoint()that auto-starts it (#174626, #174746, #175200)Inductor
torch.compile. Inductor now codegens stream context managers (enter/exit) andrecord_streamcalls in the wrapper, enabling user streams to flow through compiled regions with proper synchronization, scheduler integration, and cross-stream dependency tracking (#165390, #165391, #165504, #165505, #174223, #176700, #177694)ao::offload,ao::reload, andao::waitops for asynchronous activation offloading. These ops encapsulate async CPU offloading stream management following the same async 2-op pattern as c10d functional collectives, reducing IR size from 7 nodes (offload) and 5 nodes (reload) down to 2 nodes each (#177621)relu()), parsing the user kernel source via AST and inlining the epilogue into thetl.storeexpression (#173662).outoverloads, Inductor automatically lowers single-output and multi-output functional ops to their.outvariants asExternKernelOut, enabling memory planner buffer reuse (#175116, #176117)max_autotunenow extends to combo kernels. The autotuning pipeline generates and benchmarks per-sub-kernel block-size phase configs, with chained sequential autotuning and per-sub-kernel reduction hints (#177715, #178936, #179317)mmandaddmmfor max-autotune, enabling persistent kernels on hardware without TMA (#177781, #179095)torch.float8_e5m2dtype, including registration for FP8 GEMM autotuning (#171176)max-autotune-gemm, allowing CUTLASS-style GEMM templates to target Intel GPUs (#161938, #161939)sort,median, andmodeoperations (#178525)conv1d(#175280)at::vec::convertfor the Inductor C++ x86 backend (#172309)disable_welford_reductionconfig flag to opt out of Welford reduction in codegen (#175778)Ahead-Of-Time Inductor (AOTI)
float8_e8m0fnuandfloat4_e2m1fn_x2) to the AOTInductor C shim layer, enabling MXFP4 quantization (e.g., for AMD MI350) (#176496)torch.fx
tuple_returnoption tosplit_modulethat wraps submodule outputs in a tuple (#179007)ignore_raw_nodeoption toGraphPickler(#176939)_merge_overlapping_fusions()method toFxNetSplitterwhich detects and merges overlapping fusion groups (#177099)torch.export
float8_e8m0fnudtype (#176270)torch.uint32andtorch.uint64dtypes (#179434)List[List[float]]) (#178081)JIT
Improvements
Release Engineering
Python Frontend
uniformandnormalsampling on CPU to improve fp16/bf16 results (#175988)requires_gradtoOptional[bool]intorch.asarray(#170897)Autograd
narrow_copyderivative (#175609)grid_sample(#177487)torch.aminmax(#175215)num_splitsin varlen attention to allow disabling split_kv (#176905)AutoNamingModesupport in Selective Activation Checkpointing (#175348)torch.utils.checkpointto no longer useautograd.Functionfor saving inputs (#174327)Dataloader
Linear Algebra
_int_mmunsigned int8 × signed int8 (u8s8) support on CPU (#168226)Nested Tensor (NJT)
torch.nn
biasargument tonnnormalization methods (LayerNorm,GroupNorm,RMSNorm, etc.) (#176573)MultiMarginLosserror message for inconsistent target size (#174072)enable_gqaflag tovarlen_attn(#179468)eps=0inbatch_normduring eval mode (#175508)trunc_normal_initialization (#176240)Sparse
cloneoperator for semi-structured sparse tensors (#174991)alg_id(#178659)Build Frontend
C++ Frontend
cpp_extensionandcpp_builderto C++20 (#176659)at::Tagheader-only changes and add alibrary.defoverride for tags (#181608)Distributed
timeoutparameter totorch.distributed.barrier()(#174974)reduce_scatter_tensor_coalescedsupport toProcessGroupWrapper(#168961)batched_grad_copyoption to reduce per-parameter kernel launches to 2 kernels per bucket (#176638)BucketCapacityConfigdataclass (#175217)ChildFailedErrorexitcode output for better debugging (#175254)dist.broadcastfor FP8 tensors on GPUs older than SM90 (#175884)__torch_function__handlers for distributed functions (#176376)split_groupAPI for TorchComms on XPU (#178236)ncclxandglooto FlightRecorder trace analyzer backend allowlist (#180268)Implement missing methods in ProcessGroupWrapper(#178779)Distributed Checkpoint (DCP)
DTensor
_StridedShardingfor fullnn.Linear(DTensor)compatibility (#166483)is_pinned()support (#177235)print()HOP support (#175222)run_dtensor_rng_opcompatible withcompile_on_one_rank(#177447)_StridedShardthroughReplicate(#179059)Split(Flatten)sharding propagation (#179632)view_groups(#174629)index_select,index,index_fill,index_reduce,roll,fft,constant_pad_nd,squeeze.dims,interpolate,linalgops,LayerNorm/RMSNormFW/BW,foreach/fusedops, and einsum linearity (#176037, #176038, #178456, #175463, #175656, #173563, #176991, #176955, #179173, #177186, #177187, #176150, #174830)FullyShardedDataParallel (FSDP)
clip_grad_normto match the documented behavior (#173641)FullyShardedDataParallel2 (FSDP2)
ModuleList/ModuleDictsubclasses that implementforward()(#175033)fully_shard(#173580)shard_meshandshard_mesh_from_roothandling (#174107)Distributed Pipeline
TorchElastic
CUDA
offset_toperators to be__host____device__inSortStable.cu(#175997)avg_pool3dbackward shape-check variables in CUDA (#178893)per_process_memory_fraction+throw_on_cudamalloc_oom(#179473) (#179473)enable_annotations kwarg totorch.cuda.graph` (#179867)ReduceLogicKernel(#176132)cuDNN
MPS
abscomplex overflow/underflow on MPS (#174346)index_fill_to native Metal (#175822)histogramto float/bfloat types on MPS (#176913)unfold_backwardtotorch.complex64on MPS (#177274)scatter,gather,repeat,cumsum,logcumsumexp,cumprod, andnn.functional.linearon MPS (#177794, #178198, #178328, #178411, #178436, #178799)lerp,eye,relu,silu,fill_,xlogy,normto native Metal kernels (#177093, #178683, #178866, #179071, #176101, #177749, #177328)DeviceCapabilityfor MPS backend (#178180)enable_gqaparameter to SDPA MPS meta registration (#181550)ROCm
XPU
addmv,addmm, andbaddbmmon XPU (#174590)addcdivlowering for XPU (#176163)bmm_outer_productTriton override for XPU (#180816)IntelGPUErrorin Inductor (#169167)Profiler
Dynamo
enum.Enumiteration,nn.Module.__getattribute__,_enter_autocast/_exit_autocastand other context managers,next()onitertools.count,itertools.takewhile,bool(OrderedDict),NamedTuple.__eq__(tuple), numpyndarray.flat, andlocals()/vars()(#175176, #175527, #173877, #176521, #178818, #177876, #175394, #176729, #175787, #179595)nb_index/nb_bool/nb_floatslots so Dynamo can traceoperator.index(tensor),bool(...), andfloat(...); graph-break ontorch.Generatormethods (#178921, #178931, #179114, #180198, #178519)condsupports aliases and mutations underno_grad, autogradable leaf modules support pytree outputs,nonstrict_traceacceptsnn.Moduleinputs, andinvoke_subgraphsupports subgraph reuse (#172836, #172152, #175010, #172372, #176644)torch.cuda.stream, sync barriers via a dependency HOP,triton.set_allocatorinsidetorch.compile, and reuse of tracked objects for Tritonprune_configs_by(#177610, #168894, #177470, #177874)Inductor
OUT_DTYPE,ACC_TYPE, andINDEX_DTYPEcodegen flow in Triton templates (#179453)addcdivlowering for CUDA parity with eager and matching_foreach_addcdivto_foreach_addcmul(#174912, #175309, #175310, #175839, #176237)lerpdecompositions for bitwise parity with eager (#176804)torch.catand avoided duplicate computation incat/padwhen inputs have multiple consumers (#175729)ExternKernelOutfor output buffer reuse, and addedsymm_memplanning for graph inputs and fallback regions (#174856, #175449)pad_mmAutoHeuristics in deterministic mode (#176186, #179826)NotImplementedErrorwhenreturn_aux=AuxRequest(max_scores=True)is requested withBACKEND='FLASH'instead of failing later with an opaque error (#177434)allow_tf32tofp32_precisionto avoid divergence with the new TF32 API (#176098)prims.scalar_tensorandaten.arange.start_step(#179017, #179028)convert_element_typelowering to emulate PyTorch eager numerics (#176781)kpackTriton compile options on ROCm (#173179)aten.index_add(#179486)tile_kfrom nvMatmulHeuristics matching (#176845)Ahead-Of-Time Inductor (AOTI)
aten._grouped_mmto AOTInductor fallback ops, enabling cpp-wrapper mode for grouped_mm (#177307)AOTIPythonKernelHolder, allowing a single compiled kernel to serve multiple input shapes (#176018)native_layer_norm,aminmax) (#176019)Optional[List[T]]arguments in cpp wrapper (#174460)_scaled_dot_product_attention_math_for_mpsenable_gqa(#181549)torch.fx
get_source_partitionerto parsenn_module_stackmetadata for improved source-based graph partitioning (#175788)split_modulenow uses_make_graph_moduleto support lazy recompile (#177907)fuser_utils.topo_sortto produce a stable ordering (#175378)Composability
DynamicInt__pow__and__rpow__methods (#179868)scaled_mm_v2CPU implementation (#176266)Bug fixes
Release Engineering
Configuration
📅 Schedule: (UTC)
🚦 Automerge: Enabled.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR has been generated by Mend Renovate.